Skip to content

merge: resolve conflicts with SharpAI/SwiftLM main (DFlash integration)#1

Merged
ericjlake merged 72 commits into
ericjlake:perf/combinedfrom
SharpAI:merge/eric-pr85-resolved
Apr 26, 2026
Merged

merge: resolve conflicts with SharpAI/SwiftLM main (DFlash integration)#1
ericjlake merged 72 commits into
ericjlake:perf/combinedfrom
SharpAI:merge/eric-pr85-resolved

Conversation

@solderzzc
Copy link
Copy Markdown

Hey Eric — we tried to push this directly to your perf/combined branch but the "Allow edits from maintainers" permission blocked it. So here's a PR instead!

This merges SharpAI/SwiftLM:main into your branch to resolve the three conflicts from our DFlash integration (PR SharpAI#78) that landed after you forked.

Conflict resolution

File Your change Our main Resolution
README.md Qwen3-A3B perf table DeepSeek-V4 perf table Both kept
Server.swift save() T-dim slice fix MambaCache early return Both kept — MambaCache guard first, then T-dim slice
Server.swift decision branch Spec-decode first skipPromptCache guard (kvBits) Combined — spec-decode first, then skipPromptCache gate

Once you merge this, your PR SharpAI#85 on SharpAI/SwiftLM will be conflict-free and we can land it. 🚀

0xClandestine and others added 30 commits April 21, 2026 13:50
Critical bug fix and performance optimizations for DFlash speculative
decoding. Acceptance rate improved from 25% to 89% (matching Python
reference), throughput from 6.7 to 42 tok/s.

Root cause: hiddenNorm was declared as  without @ModuleInfo,
so its RMSNorm weight was never loaded from safetensors. The key
"hidden_norm.weight" didn't match the reflected key
"hiddenNorm.weight", leaving the weight at all-ones instead of
the trained values (~0.98). This single missing weight distorted
every draft prediction, compounding through all 5 draft layers.

Fix: Added @ModuleInfo(key: "hidden_norm") annotation, matching
the safetensors key. Also added @ModuleInfo for norm and fc for
consistency.

Performance optimizations:
- Streaming: replaced generateSync + buffered array with
  generateStreaming + Continuation, yielding tokens immediately
- Draft prefetch: launch next cycle's draft with asyncEval before
  rollback, overlapping GPU work
- Batched asyncEval: changed blocking eval() to asyncEval() for
  verify logits and hidden states
- asyncEval(committedHidden): unblocks prefetch window
- Stop token Set: precomputed O(1) lookup
- Removed double fflush, added DFlashDumper call-site guards

Submodule updates:
- mlx-swift-lm: exactSmallProjPad for quantized linear at small
  seq_len (<16), DFlash protocols, open MambaCache/ArraysCache
- mlx-swift: remove stale .air kernel files

Benchmark (Qwen3.5-27B-4bit, thinking mode, 2048 tokens):
  41.9 tok/s, 89.4% acceptance, 216 cycles
… streaming

When SSD expert streaming is active, expert weight tensors (.weight) are
replaced with zero-filled placeholders of the correct shape/dtype during
loading. Only scales and biases are loaded into RAM — the actual expert
weight data is read from SSD at runtime via pread/mmap.

RAM savings for MoE models:
  - Qwen3.6-35B-A3B: 18.4 GB → 5.1 GB (73% reduction)
  - Expert weights skipped: 16.1 GB (weight only, not scales/biases)
  - Expert scales+biases loaded: ~2 GB (needed for dequantization)

Performance on Qwen3.6-35B-A3B (512 tokens, math prompt):
  - No SSD streaming:   11.5 tok/s,  18.4 GB RAM
  - SSD streaming only: 11.5 tok/s,   5.1 GB RAM
  - SSD + DFlash:       32.2 tok/s,   5.1 GB RAM
Both streaming and non-streaming chat/text completion responses now include
a 'timings' object with:
  - predicted_per_second: generation speed in tokens/second
  - predicted_n: number of completion tokens
  - predicted_ms: total generation wall-clock time in ms

This matches llama-server's timing convention and allows clients to see
generation speed directly from the API response without external measurement.
Tests 4 configurations for Qwen3.6-35B-A3B-4bit with same math prompt:
  - Baseline (no SSD, no DFlash)
  - SSD Streaming only
  - SSD Streaming + DFlash
  - DFlash only

Results (512 tokens, 3 runs each):
  Baseline:      26.3 tok/s,  18.8 GB RAM
  SSD Streaming: 12.5 tok/s,   5.4 GB RAM
  SSD + DFlash:  33.3 tok/s,   7.4 GB RAM  ← best tradeoff
  DFlash only:  125.4 tok/s,  20.0 GB RAM
- Add StreamableMoE conformance to Qwen3NextModelInner
- Add LayerPartitionable conformance to Qwen3NextModelInner
- Add DFlashTargetModel conformance to Qwen3NextModel
  - dflashEmbedTokens, dflashLmHeadLogits, dflashForwardWithCapture
  - dflashGatedDeltaForward with tape recording for GDN rollback
- Add dflashForwardWithTape to Qwen3NextGatedDeltaNet
- Add bridge file Qwen3Next+DFlash.swift
- Short prompt works: 68.8% acceptance, 9.8 GB RAM (vs 45 GB full load)
- Longer runs crash — likely Metal watchdog on 512-expert SSD reads
…b440)

- Bumps mlx-swift-lm submodule to b440 (tag) / 63707c0:
  fix(Gemma4Text): dispatch QuantizedKVCache correctly in LLM attention
  (merges PR #29, closes #71)

- Server.swift: expose `kv_bits` as a per-request API field
  (ChatCompletionRequest.kvBits -> GenerateParameters.kvBits)
  enabling native MLX QuantizedKVCache without a server restart.

- run_benchmark.sh: add Test 9 — QuantizedKVCache regression suite
  [1/4] kv_bits=4 short  [2/4] kv_bits=8 short
  [3/4] kv_bits=4 long (KV-sharing path)  [4/4] baseline

  Test 9 passed on mlx-community/gemma-4-26b-a4b-it-4bit.
README.md:
- Added '🔧 Per-Request API Parameters' section with kv_bits table,
  kv_bits vs --turbo-kv comparison table, and curl usage example
- Clarified --turbo-kv CLI entry: 'activates after 2048 tokens, server-wide'

Server.swift:
- Added kv_bits input validation (only nil/4/8 accepted; returns 400 otherwise)
- Bypass prompt cache restore when kv_bits is set (prevents unsafe mixing of
  QuantizedKVCache and KVCacheSimple states across requests)
- Bypass prompt cache save when kv_bits is set (same safety reason)

run_benchmark.sh (Test 9):
- Corrected header comment to match actual assertions (removed false ≥20 token
  and multi-turn claims; stated actual ≥3 token / non-empty checks)
- Added explicit SERVER_READY flag + post-loop failure with log dump
- Widened thinking-block regex to handle both <|channel|>thought and <|channel>thought
- Replace 🧠 with 📡 heading emoji
- Rewrite as structured tables (Text / Vision / Audio) with all 50+ model
  families derived from the actual MLXLLM + MLXVLM model file inventory
- LLM table: Gemma, Qwen, Phi, Mistral, Llama, GLM, DeepSeek, Falcon,
  LFM2, OLMo, Granite, SmolLM3, InternLM2, Cohere, Jamba, Exaone, MiMo,
  Ernie, Baichuan, Bailing, NemotronH, Starcoder2, OpenELM, BitNet,
  MiniMax, Apertus/AfMoE, MiniCPM, Qwen3Next
- VLM table: Gemma4, Gemma3, Qwen3-VL, Qwen2-VL/2.5-VL, LFM2-VL,
  Pixtral, PaliGemma, Idefics3, Mistral3, FastVLM, SmolVLM2, GlmOcr, QwenVL
- ALM table: Gemma-4-e4b only (factually correct — Qwen2-Audio removed;
  it was never wired into the audio pipeline here)
fix: Gemma-4 QuantizedKVCache + kv_bits API + Test 9 (mlx-swift-lm b440)
… fix CORS/parallel test gaps

- Server.swift: add defer-based heartbeat cleanup in both handleChatStreaming and
  handleTextStreaming so heartbeatTask is always cancelled on any exit path
  (client disconnect during prefill no longer leaks the heartbeat task)
- ServerSSETests.swift: add missing import Foundation for Data/JSONSerialization
- test-server.sh Test 32: fail on empty curl response instead of false-passing
- test-server.sh Test 33: use conditional curl; fail if request fails entirely
- test-server.sh Test 34: redirect CORS preflight to CORS_PORT (--cors server)
  instead of the main server which has no CORS middleware
- test-server.sh Test 35: spin up a dedicated --parallel 2 server so concurrent
  requests actually overlap and stress the global hook under real parallelism
- test-opencode.sh: capture opencode exit code separately; classify parse errors
  vs acceptable non-zero exits to prevent false passes
…ch in Tests 32-33

The new conditional curl patterns in Tests 32 and 33 combined with the
existing set -euo pipefail caused the script to abort when grep found
no match (exit 1) in the EVENT_DATA pipeline. All grep/jq calls that
may produce no output now use || true or are wrapped in if/else to
prevent premature script exit.
[codex] make OpenAI streaming strict by default
…experts are combined

Fixes #72: on a 16GB Mac Mini M4, adding --draft-model alongside --stream-experts
caused RAM to spike to the physical limit and trigger swap, even though the draft
model is only a 4B (~3.5GB) model.

Root causes and fixes:
1. [Bug] draftConfig.lazyLoad was never set — draft weights were eagerly paged into
   unified RAM. Fix: set draftConfig.lazyLoad = true when --stream-experts is active,
   mirroring what already happens for the main model config.

2. [Bug] Memory.cacheLimit / Memory.memoryLimit were applied after both model loads,
   so neither the main nor draft model loaded under a cache budget. Fix: apply the
   SSD memory cap immediately after ExpertStreamingConfig.shared.activate() — before
   any LLMModelFactory.loadContainer() calls — so both models respect the page-cache
   limit throughout loading.

3. [Bug] physicalBudget did not account for the draft model's resident footprint,
   leaving the cap 3–4 GB too high. Fix: profile the draft model directory before
   loading and subtract its weightMemoryGB from physicalBudget in all three affected
   strategy branches (swapAssisted, layerPartitioned, early cap). A 2 GB floor guard
   prevents the budget going negative on very constrained machines.

Expected result on 16GB M4:
- Draft model weights are mmap'd (lazy) — only accessed pages in RAM
- Both models load under the ~6GB effective page-cache budget (9.6GB - 3.5GB draft)
- No swap; total RAM stays within the SSD streaming budget
…draft model

- Extract computeSSDMemoryBudget() from inline formula so it can be unit tested
  without loading a real model or touching Memory.cacheLimit
- Wire all three budget call sites to use the extracted function (no behaviour change)
- Add SSDMemoryBudgetTests.swift with 8 tests covering:
    * Baseline 16 GB / no draft (formula correctness)
    * Issue #72 regression: 16 GB + 3.5 GB draft → budget reduced by exact footprint
    * Floor guard: deeply negative raw result clamped to 2 GB
    * Floor value: confirmed at exactly 2 GB
    * Default-arg == 0 (no silent reduction without a draft model)
    * Monotonicity: larger draft → smaller or equal budget
    * Typical fleet: 24 GB and 64 GB with 3.5 GB draft
Two correctness issues flagged in inline review:

1. GiB/GB unit mismatch — weightMemoryGB is computed as bytes/1e9 (decimal GB),
   but was multiplied back to bytes using 1_073_741_824 (GiB), causing ~7% budget
   drift. Fix: use draftProfile.weightFileSizeBytes directly (exact bytes, no
   conversion needed).

2. Repeated ModelProfiler.profile() filesystem walks — the draft model directory
   was enumerated once in the early cap block and again in each strategy branch
   (swapAssisted, layerPartitioned). Fix: compute draftFootprintBytes once before
   the streamExperts block and reuse it everywhere.

Also addresses a third Copilot comment: the early SSD cap was only applied when
modelDirectory != nil, so first-run downloads were unprotected. Now the cap is
applied whenever --stream-experts is set, even if the model isn't cached yet
(modelling via the else-if branch).

All 8 SSDMemoryBudgetTests still pass.
fix(ssd-stream): prevent RAM explosion when --draft-model + --stream-experts combined (#72)
…odel (#72 follow-up)

Reporter confirmed the original fix addressed load-time RAM, but swap still
explodes during inference: OS_RAM=20.7GB / MEM_DEMAND=40.2GB on a 16GB machine.

Root cause (inference-time):
The 200GB memoryLimit sentinel is necessary for SSD streaming alone — it bypasses
MLX eval_impl's spin-wait loop when expert pages are evicted mid-graph.  However,
with speculative decoding the draft model (4B / 3GB) and main model (35B / 20GB)
alternate forward passes in tight succession.  Both models' expert pages are
demanded within the same inference cycle, combined demand ~23GB >> 16GB physical.
The 200GB sentinel provides zero back-pressure, so macOS swaps aggressively
(10+ GB observed in Activity Monitor).

Fix:
When --stream-experts + --draft-model are both set AND combinedFootprint > 70%
of physical RAM, lower memoryLimit from 200GB to physicalRAM × 1.1.  This forces
MLX to hit its hard limit sooner and evict stale expert pages more aggressively
rather than extending into swap.  A clear startup warning is also printed:

  ⚠️  SSD + draft-model RAM pressure warning:
     Main model: 20.4GB  Draft: 3.0GB  Combined: 23.4GB  Physical RAM: 16.0GB
     Speculative decoding alternates both models' forward passes.
     On this machine the combined weight exceeds physical RAM,
     causing page-cache thrashing and swap during inference.
     → Recommendation: remove --draft-model on this machine,
       or use a smaller draft model whose weights fit in
       remaining RAM after the main model's page budget (6GB).
     Memory limit set to 17GB (tight cap for MLX eviction pressure)

When combined footprint fits in RAM (e.g. smaller draft on a 32GB machine),
the 200GB sentinel is still used as before — no regression for capable hardware.
…el cache

Replace if-branch masking with metal::select for zero warp-divergence state
updates. Reorganize KernelCache from 8 flat named vars to tapeReplay[vec][msk]
and gatedDeltaTape[vec][msk] 2D arrays. Simplify dispatch call sites to
one-liner index lookups. Minor whitespace cleanup in DFlashIntermediateDumper.
… property

Add MambaSnapshotCache: lightweight O(1) snapshot-based rollback (lazy
reference capture, no GPU copy) as an alternative to RecurrentRollbackCache's
innovation-tape replay. Add dflashUseTapeRollback Bool to DFlashTargetModel
(default true) so models can opt in to either strategy. Update makeTargetCache
and arm/rollback helpers with clearer comments.

Also switch RecurrentRollbackCache.armRollback to lazy reference capture
(removes unnecessary MLX.contiguous copies on arm path).
Add DFlashKernelBench executable for isolated kernel timing. Exclude
DFlashKernelsOptimized.swift from the DFlash library target (work-in-progress
alternative kernel implementations kept for reference).
…next.sh

bench_35b.sh: save per-run raw response JSON, extract structured results into
bench_results.json (tok/s, RAM, timing per config) for downstream tooling.
Use slug variable consistently for log file naming.

Add bench_coder_next.sh for benchmarking Qwen3-Coder-Next model variants.
Move comparison tests from tests/DFlashComparison/ to tests/DFlash/, adding
DFlashBenchmark.swift, DFlashProfiler.swift, updated cosine similarity
comparison tools, and a README. Update .gitignore intermediates path.
…-draft-model (#72)

Git history audit (mlx-swift-lm):
  e6ba580 - 8.5x speedup (0.58→4.95 tok/s) from cross-projection batching (Eric Lake, M1 Ultra)
  2c71c6c - ssd-opt-v2: +4% more via persistent expert buffers (asyncEval warm path)
  2b1c653 - PAPPS N+1 prefetch permanently disabled (hurt Apple-native TPS)

README (line 245) explicitly states:
  'Speculative decoding is counterproductive for SSD-streaming MoE specifically.
   The verify pass sends N+1 tokens, each routing to *different* experts — SSD I/O
   scales with the *union* of all positions' expert selections.'

Strategy (not a hard error):
When --stream-experts + --draft-model are combined:
  - Auto-cap --num-draft-tokens to 1 (verify pass = 2 positions, not N+1)
  - At 1 draft token: fan-out is 2× SSD I/O (vs 5× at default 4 tokens)
  - If acceptance rate ≥ 50% (typical for same-family models), net TPS is positive
  - Print a clear advisory so users understand the tradeoff
  - Persistent expert buffers (~5 GB warm path, ssd-opt-v2) are PRESERVED —
    no regression to Eric Lake's M1 Ultra benchmark

What is NOT changed:
  - SwitchLayers.swift warm path: untouched (idx.size <= 32 guard intact)
  - ExpertStreamingConfig: no new flags added (reverted failed hasDraftModel attempt)
  - computeSSDMemoryBudget() + cacheLimit logic from load-time fix: intact
  - Tight memoryLimit sentinel (physicalRAM × 1.1) when combined > 70% RAM: intact

Test coverage (18 tests, 0 failures):
  SSDDraftStrategyTests (10 new):
    - Fan-out arithmetic: 4 draft tokens → 5× I/O, 1 token → 2× I/O
    - Auto-cap fires only when streamExperts + draftModel + numDraftTokens > 1
    - Auto-cap does NOT fire for solo SSD streaming or pure RAM speculative decoding
    - Net throughput model: 70% acceptance at 2× fan-out is net positive
    - memoryLimit sentinel selection: tight cap on 16 GB, sentinel on 64 GB
  SSDMemoryBudgetTests (8 existing): all pass, no regressions
…sion

Three-check E2E test for the --stream-experts + --draft-model fix:

  [1/3] Auto-cap guard: verifies server log contains the 'auto-capping'
        warning, proving numDraftTokens was reduced from 4 to 1 at startup

  [2/3] RAM guard: measures vm_stat peak RAM during inference and fails
        if it exceeds 80% of physical RAM (the indicator that exposed the
        original swap explosion on reporter's 16GB M4 Mini)

  [3/3] Inference: verifies the combination still produces valid content
        (not crashed/empty), proving functional correctness

Uses small models (Qwen3.5-4B main + Qwen3.5-0.8B draft) — same
parameter-class proportions as the reporter's 35B+4B scenario but
runnable on any machine without 35B weights.

Run: ./run_benchmark.sh → option 10
github-actions Bot and others added 22 commits April 23, 2026 15:43
Prompt cache save/restore was incorrectly applied to Qwen3Next which
uses a hybrid KVCache+MambaCache architecture. MambaCache RNN states
cannot be arbitrarily trimmed or replayed at arbitrary token boundaries
unlike KVCacheSimple, so attempting to restore a partial match would
corrupt the linear attention state and cause spurious 1-token outputs.

Fix: PromptCache.save() and PromptCache.restore() now skip immediately
if any layer in the cache is a MambaCache instance.

Also fixes run_benchmark.sh Test 0 (automated matrix) to pass MODEL
via environment variable instead of feeding it through stdin, so the
model selection prompt is correctly bypassed when MODEL is pre-set.
Replacing the stdin pipe approach with an env var so child invocations
from Test 0's automated matrix loop skip the interactive menu entirely.
The previous echo-pipe was consumed by the 'read suite_opt' prompt but
any subsequent reads (model selection) had no input, causing the script
to fall through to option 3 by default.
When SUITE_OPT is set (automated matrix mode), skip all menu echoes
and the read prompt entirely. Child processes now run silently with
only test-relevant output.
Both test-speculative.sh and test-dflash.sh grep for 'Using speculative
decoding' in the server log to confirm the speculative path was activated.
This string was never emitted — the tests were checking a log line that
didn't exist, causing speculative-decoding and dflash-speculative-decoding
CI jobs to always fail on Test 1.

Fix: emit the exact expected log line:
  - Standard spec: after draft model is loaded successfully
  - DFlash spec: at generation dispatch in Server.swift

Server log now contains all strings the tests grep for:
  ✅ 'Draft model loaded successfully'
  ✅ 'Using speculative decoding'
  ✅ 'speculative decoding' (for test-speculative-eval.sh)
test-dflash.sh grepped for:
  1. 'Draft model loaded successfully' — only emitted by standard draft path,
     not DFlash path which has its own 'DFlash draft model loaded' message
  2. 'Using speculative decoding' — not emitted by DFlash path at all
  3. 'speculative decoding' — was present but test was failing on (1)

Add both required lines immediately after DFlash draft model weights load,
mirroring the standard speculative decoding path. The streaming failures
('missing [DONE] sentinel') were downstream of the model-not-found state
caused by the load log mismatch, not an inference bug.
Adds Sources/SwiftLM/{Qwen3,Qwen3MoE,Llama}+DFlash.swift — each
declares the DFlashTargetModel protocol conformance and delegates to
the model's public callCapturing / embedTokens / lmHead
(now on *ModelInner via mlx-swift-lm b453).

Coverage:
  Qwen3Model      → Qwen3-8B and similar dense Qwen3 variants
  Qwen3MoEModel   → Qwen3-Coder-30B-A3B and other Qwen3 MoE variants
  LlamaModel      → Meta-Llama-3.x, Mistral, and Llama-family models
  Qwen35MoEModel  → already covered via Qwen35Model inheritance
  Qwen36MoE       → no separate Swift class found; uses Qwen35MoE path

Co-authored-by: clandestine.eth <96172957+0xClandestine@users.noreply.github.com>
Gemma4 omni (5.2GB) on a 7.5GB runner is tight. After other CI jobs
have run and filled the model cache, available RAM can drop below the
threshold needed for stable Metal command buffer execution, causing
sporadic GPU timeout crashes (kIOGPUCommandBufferCallbackErrorTimeout).

Add a vm_stat-based preflight check: if available+inactive RAM < 2.5GB,
exit 0 (skip) instead of crashing the whole run.
Own DeepSeek V3 (deepseek_v3 / kimi_k25) and Kimi Linear (kimi_linear)
model implementations directly in SwiftLM so DFlashTargetModel conformance
is available without any upstream submodule changes.

- DeepseekV3DFlash.swift: full DSV3Config + model with callCapturing
- KimiLinearDFlash.swift: hybrid KDA/MLA Kimi 2.6 model with DFlash
- DFlashModelRegistry.swift: registers all three model types via
  LLMTypeRegistry.shared.registerModelType() at startup
- Server.swift: call registerDFlashModelTypes() before model loading
Use @ModuleInfo(key: "model") on the inner model property so weights
at model.* paths are found correctly. Also use @ModuleInfo(key: "norm")
for norm layers initialized in init() so their weights are tracked.
… limit

DeepseekV3DFlash.sanitize():
- Strip 'language_model.' wrapper prefix present in kimi_k25 and some
  other HuggingFace exports so weight keys resolve to model.* paths
- After stacking per-expert weights into switch_mlp, remove the original
  experts.N.* keys to prevent verify: .noUnusedKeys crash
- Generalize layer filter to use numHiddenLayers instead of hardcoded 61

Server.run():
- Raise RLIMIT_NOFILE to 4096 at startup; large sharded models (kimi_k25
  has 182 safetensor shards) exhaust the default macOS limit of 256
- Move MLX_MAX_OPS_PER_BUFFER=50 to top of run() before Metal init
- Enable --stream-experts automatically on <12GB machines in test-dflash.sh
  so weights are paged via mmap/pread instead of macOS VM swap
- Auto-cap draft tokens to 1 under SSD streaming (minimal fan-out)
- Always compute draftFootprintBytes regardless of --stream-experts flag
* feat: bump mlx-swift-lm submodule for DeepSeek-V4 support

Points mlx-swift-lm to feat/deepseek-v4 branch (SharpAI/mlx-swift-lm#33)
which adds DeepseekV4.swift and registers the deepseek_v4 model type.

* feat: DeepSeek-V4-Flash benchmark results + profiler improvements

- README: add DeepSeek-V4-Flash (126GB Q3) benchmark table for M5 Pro 64GB
  SSD+TurboQuant delivers 4.16 tok/s at 40K context (13x vs plain SSD Stream)
- profile_runner.py: track peak GPU InUse via background polling thread (0.5s)
  instead of single post-generation snapshot; rename gpu_in_use → gpu_in_use_peak
  throughout; add separate GPU_InUse peak visualization section
- run_benchmark.sh: add Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine
  to Test 1 model list (option 11)
- mlx-swift-lm: bump submodule to 8a8da29 (attn_sink dtype fix)

* chore: bump mlx-swift-lm submodule to b463 (DeepSeek-V4 merged to main)
feat: add DFlash speculative decoding
Merges ericjlake's prompt-cache fixes from PR #85, resolving conflicts
with the DFlash integration (PR #78).

Changes from ericjlake:
- MambaCache safety gate + KVCacheSimple T-dim slice in save()
- ndim >= 3 guard in minCachedSeqLen scan
- Spec-decode short-circuit ordering (check before cache restore)
- README: Qwen3-A3B full-RAM perf table (M1 Ultra 64 GB)

Conflict resolution:
- README.md: kept both Qwen3-A3B and DeepSeek-V4 perf tables
- Server.swift save(): kept existing MambaCache early return + new T-dim slice
- Server.swift decision branch: combined spec-decode-first + skipPromptCache (kvBits)

Closes #84.
Co-authored-by: Eric Lake <ericjlake@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR resolves post-fork conflicts by merging SharpAI/SwiftLM:main into the target branch while preserving DFlash integration changes, and expands test/CI coverage around SSE streaming strictness, OpenAI SDK compatibility, and DFlash/SSD+draft regressions.

Changes:

  • Adds a new DFlash SwiftPM library/module (runtime, draft model, rollback caches, model bridges) and related benchmarking/profiling utilities.
  • Extends integration + unit tests for SSE strict streaming / opt-in heartbeat, OpenAI SDK parsing compatibility, DFlash E2E, and SSD+draft memory guard (Issue SharpAI#72).
  • Updates profiling scripts/docs and CI workflows to run the new test suites and regression jobs.

Reviewed changes

Copilot reviewed 44 out of 46 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test-server.sh Adds SSE strict-streaming and opt-in heartbeat/CORS/concurrency integration tests.
tests/test-opencode.sh New OpenAI SDK + OpenCode CLI compatibility integration test.
tests/test-dflash.sh New DFlash speculative decoding E2E test script (dual-model, streaming, stability).
tests/SwiftLMTests/ServerSSETests.swift New XCTest coverage for SSE prefill chunks, header parsing, and PrefillState invariants.
tests/SwiftLMTests/SSDPersistentBufferGuardTests.swift New regression tests for SSD streaming + draft auto-cap/memory-limit strategy (Issue SharpAI#72).
tests/DFlash/dump_python_intermediates.py Python reference dumper for DFlash intermediate tensors (.npy + meta).
tests/DFlash/compare_swift_python.py Compares Python vs Swift dumps via cosine similarity to localize divergence.
tests/DFlash/compare_cosine.py Python self-consistency + “Swift-equivalent” path comparison helper.
tests/DFlash/README.md Documentation for DFlash benchmarking/profiling/comparison tooling.
tests/DFlash/DFlashProfiler.swift Swift micro-profiler for kernel performance and numerical consistency.
tests/DFlash/DFlashCosSimComparison.swift Swift-side comparison tool scaffolding for Python ↔ Swift intermediates.
scripts/profiling/profile_runner.py Adds background polling to capture peak GPU “in use” memory during requests.
scripts/profiling/bench_coder_next.sh Adds benchmark runner for Qwen3-Coder-Next across SSD/DFlash configs.
scripts/profiling/bench_35b.sh Adds benchmark runner + JSON extraction for 35B DFlash/SSD configs.
run_benchmark.sh Adds headless invocation support, new test menu entries, and Issue SharpAI#71/SharpAI#72 regression suites.
docs/profiling/profiling_results_simbas-MacBook-Pro.md Updates recorded profiling results/table schema for new GPU peak metric.
Sources/SwiftLM/Qwen3Next+DFlash.swift Adds DFlashTargetModel bridge for Qwen3Next models.
Sources/SwiftLM/Qwen3MoE+DFlash.swift Adds DFlashTargetModel bridge for Qwen3 MoE models.
Sources/SwiftLM/Qwen35+DFlash.swift Adds DFlashTargetModel conformance bridges for Qwen3.5 models.
Sources/SwiftLM/Qwen3+DFlash.swift Adds DFlashTargetModel bridge for Qwen3 dense models.
Sources/SwiftLM/Llama+DFlash.swift Adds DFlashTargetModel bridge for Llama/Mistral-style models.
Sources/SwiftLM/DeepseekV3DFlash.swift Adds SwiftLM-owned DeepSeek V3 model implementation with DFlash support.
Sources/SwiftLM/DFlashModelRegistry.swift Registers SwiftLM-owned DFlash-capable model types in the global registry.
Sources/SwiftLM/ModelProfiler.swift Accounts for draft-model weight bytes and adjusts swap-assisted memoryLimit sentinel.
Sources/DFlash/RecurrentRollbackCache.swift Adds rollback-capable MambaCache subclasses (tape replay + snapshot rollback).
Sources/DFlash/DFlashRuntime.swift Core DFlash runtime: prefill, draft/verify loop, accept/reject, rollback, event streaming.
Sources/DFlash/DFlashKernelProvider.swift Adds global provider registry for specialized DFlash kernels.
Sources/DFlash/DFlashEngine.swift Adds engine abstraction for verify/rollback strategies (full-attn vs hybrid-GDN).
Sources/DFlash/DFlashDraftRegistry.swift Maps known target model names to draft model IDs (auto-resolution).
Sources/DFlash/DFlashDraftModel.swift Implements the DFlash block-diffusion draft model + context-only KV cache.
Sources/DFlash/DFlashDraftBackend.swift Implements greedy draft-token generation backend using target embed/lm_head.
Sources/DFlash/DFlashIntermediateDumper.swift Adds .npy dump utility for Swift intermediates to compare with Python reference.
README.md Updates supported models/methodologies and adds DeepSeek V4 Flash profiling results + notes.
Package.swift Adds DFlash library, DFlashKernelBench executable, and SwiftLMTests test target.
Package.resolved Updates dependency lock revisions/versions.
.gitignore Ignores generated DFlash intermediates directory.
.github/workflows/ci.yml Runs new SwiftLMTests, adds opencode modality, and introduces DFlash + Issue SharpAI#72 guard jobs.
.agents/workflows/review-github-pr.md Adds/updates internal workflow guidance for reviewing SharpAI/SwiftLM PRs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +534 to +536
print("[DFlash] Cycle \(cyclesCompleted + 1): blockLen=\(blockLen), verifyLen=\(verifyTokenIDs.dim(0)), acceptanceLen=\(acceptanceLen), commitCount=\(1 + acceptanceLen)")
fflush(stdout)

/// Snapshot of the cache state before the verify pass.
private var snapshotState: [MLXArray?]?

public init(convKernelSize: Int = 4) {
Comment thread tests/test-server.sh
Comment on lines +976 to +983
if [ -z "$STRICT_STREAM" ] || ! echo "$STRICT_STREAM" | grep -q 'data: \[DONE\]'; then
# Only fail if it was a curl failure (empty), not a missing event
[ -z "$STRICT_STREAM" ] && fail "Strict mode: stream was empty"
elif echo "$STRICT_STREAM" | grep -q "^event:"; then
fail "Strict mode: unexpected named SSE event without opt-in header"
else
pass "Strict mode: no named SSE events in default streaming"
fi
Comment on lines +370 to +378
if targetHidden == nil {
targetHidden = MLXArray.zeros(
[feat.dim(0), promptLen, feat.dim(-1)],
dtype: feat.dtype
)
}
targetHidden![0..., chunkStart ..< chunkEnd, 0...] = feat
eval(targetHidden!)


/// Registry to allow models to use DFlash kernels without module circular dependencies.
public struct DFlashKernelRegistry: Sendable {
public nonisolated(unsafe) static var provider: DFlashKernelProvider? = nil
Comment thread tests/DFlash/README.md
Comment on lines +113 to +136
### 3. DFlashCosSimComparison.swift
Compares intermediate values between Python and Swift implementations.

**Usage:**
```bash
swift run DFlashCompare --dir tests/DFlashComparison/intermediates
```

## Python Comparison

The benchmark format is compatible with `dflash-mlx/benchmark/` results:
- Same JSON structure
- Same metrics (TPS, TTFT, acceptance ratio)
- Same hardware info collection

You can compare Swift vs Python results by loading both JSON files and comparing the `summary` sections.

## Results Directory

Create a `results/` directory here or specify custom output paths:
```bash
mkdir -p tests/DFlashComparison/results
swift run DFlashBenchmark --output tests/DFlashComparison/results/benchmark.json
```
Comment thread tests/test-opencode.sh
Comment on lines +123 to +127
log "Installing opencode-ai in isolated directory..."
mkdir -p /tmp/opencode_cli_test
cd /tmp/opencode_cli_test
npm install opencode-ai@latest --silent >/dev/null 2>&1

Comment thread tests/test-dflash.sh
Comment on lines +2 to +13
# test-speculative.sh — Speculative decoding E2E verification
#
# Uses a small draft model (Qwen3.5-0.8B) to accelerate a larger main model
# (Qwen3.5-4B) via speculative decoding. Verifies:
# 1. Dual-model loading (draft + main)
# 2. Speculative decoding path activation
# 3. Correct token generation
# 4. Server stability under dual-model memory pressure
#
# Usage:
# ./tests/test-speculative.sh [binary_path] [port]
#
Comment on lines +250 to +254
) -> AsyncStream<DFlashEvent> {
// Streaming: yield events from inside the generation loop
// via a Continuation, avoiding the buffered-array bottleneck.
AsyncStream(bufferingPolicy: .unbounded) { continuation in
let task = Task {
@solderzzc
Copy link
Copy Markdown
Author

Heads up — the diff looks massive (~10k lines) but that's just because it's merging our current main into your branch, which includes @0xClandestine's entire DFlash integration (PR SharpAI#78, ~6k lines of new code) that landed after you forked. All of that is already reviewed and merged on our side.

The only thing that needs your attention is the conflict resolution in 2 files:

  1. README.md — kept both your Qwen3-A3B perf table and the DeepSeek-V4 table from main
  2. Server.swift — three merge points:
    • save(): your T-dim slice fix placed after our existing MambaCache early return guard (both kept)
    • restore(): your recurrent-layer safety gate and ndim >= 3 guard added cleanly (no conflict, auto-merged)
    • Decision branch: your spec-decode-first ordering combined with our skipPromptCache guard (which adds kvBits safety on top of isMultimodalRequest)

Once you merge this, your PR SharpAI#85 on SharpAI/SwiftLM will be conflict-free. 👍

@ericjlake ericjlake merged commit 53b040d into ericjlake:perf/combined Apr 26, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants